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Abstract 

Many discussions have enlarged the hterature in Bibliometrics since the Hirsh proposal, the so called /i-index. Ranking 
papers according to their citations, this index quantifies a researcher only by its greatest possible number of papers 
that are cited at least h times. A closed formula for /i-index distribution that can be applied for distinct databases 
is not yet known. In fact, to obtain such distribution, the knowledge of citation distribution of the authors and its 
specificities are required. Instead of dealing with researchers randomly chosen, here we address different groups based 
on distinct databases. The first group is composed by physicists and biologists, with data extracted from Institute 
of Scientific Information (ISI). The second group composed by computer scientists, which data were extracted from 
Google-Scholar system. In this paper, we obtain a general formula for the /i-index probability density function (pdf) 
for groups of authors by using generalized exponentials in the context of escort probability. Our analysis includes 
the use of several statistical methods to estimate the necessary parameters. Also an exhaustive comparison among 
the possible candidate distributions are used to describe the way the citations are distributed among authors. The 
/i-index pdf should be used to classify groups of researchers from a quantitative point of view, which is meaningfully 
interesting to eliminate obscure qualitative methods. 
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1. Introduction 

The scientific community has not been the same 
after the publication of the polemic index of Hirsh 
(/i- index). If an author has h-'mdex equal to h means 
that she has h papers with at least h citations but 
she has not ft, + 1 papers with at least h + 1 citations 
[1]. This definition leads to an important fact: an 

14 November 2011 



author with index h has at least /i^ citations [1], a 
lower bound in the citation number. 

This simple but powerfull idea has also suscitated 
other informetric formulations [2] . This index joins 
attributes as: productivity, quality and homogeneity 
in the same measure. It has also been used as one of 
the most important measures to quantify scientists 
in order to obtain fairer rankings [3] . Fairer rankings 
should mean, for example, grants fairer distributed 
among scientists really based on capability therefore 
it must stimulate a healthy competition. 

Other successful variations of the /i-indcx have 
been proposed. For instance, dividing the h top pa- 
pers index by the average of authors in these h pa- 
pers leads to: hj — h? /Nt [4]. Considering that a 
massive part of publications cannot be used to com- 
pute the ft,-index, an interesting and alternative in- 
dex that considers the weight of this mass of these 
lazy papers has been proposed in Refs [5-7]). Other 
metrics consider that not an individual ft,-index for 
each scientist might be considered but also its ver- 
sion for a group of researchers, which is denoted as 
successive /i-index [8,9]. 

Laeherre and Sornette [10] were the first to ad- 
dress the researcher citation distribution. They 
have ranked 1120 physicists according to their to- 
tal number of citations. The number of researchers 
N{x) as function of their citation number x follows 
a stretched exponential function: 

iV(x) =7Voexp[-(x/xo)''] (1) 

with (i « 0.3. Here A^o = ^(0) is number of authors 
with no citation and xq an parameter that can be 
estimated for example if the citation mean is known. 
Of course, citation number x in an integer variable, 
but here we have considered it as a continuous vari- 
able. 

Alternatively, Redner [11] has also addressed this 
questioning in a slightly different way. The proba- 
bility distribution of citations of 783.339 scientific 
papers, not authors, published in 1981, with the 
6.716.198 citations obtained between 1981 and 1997 
in the base Institute of Scientific Information (ISI) 
has been studied. The envelope of this distribution 
presents a stretched exponential behavior for low ci- 
tation number a; < Xc, with Xc = 200 and for large 
citation number x > Xc, the power law behavior is 
dominant N{x) ^ with a « 3.0. 

Tsallis and Albuquerque [12] observed that Red- 
ner 's probability distribution function (pdf) and the 
paper (not author) citation pdf could be better de- 
scribed by a generalized exponential distribution 



that covered the two situations {x < Xc and x > Xc)- 

Ng{x) = Noiexpgi-Xx)]" (2) 

where generalized exponential function (g-exponential) 
is expg(x) = [1 -h (1 - g)a;] 1/(1-9), if (1 - q)x > -1 
and it vanishes otherwise. The g-exponential inverse 
function is the g-logarithm: \nq{x) = (x^"* — 1)/(1 — 
q) [13-15]). The parameter A in Eq. (2) is obtained 
constraining the citation per paper average number 
to be a constant, (x) = /q dx x Pq{x) = constant, 
where /^.((x) = [cxp,,(-Ax)] V /c^ o!?y[cxpq(-A?y)]«. 
Although in Ref. [12], this approach has been used 
only for distribution of scientific papers, we show 
that it can be also applied for author citation prob- 
ability distribution. 

Here, we consider the stretched and generalized 
exponential pdf 's to describe the envelope of the au- 
thor citation distribution. Also, we ask which pdf is 
more appropriate to describe the /i-index distribu- 
tion of authors of distinct groups of researchers. Us- 
ing a continuous approximation for citation distri- 
butions, two different researcher groups are collected 
from two different database. One from Graduate 
Programs in Physics and Biology of public univer- 
sities in Brazil using their ISI publication registry. 
The other, from software engineering area, where 
we computed the /i-indices and the total citations 
of the members of program committees of different 
conferences. It is important to stress, that our pur- 
pose, is to show that the same /i-index pdf is veri- 
fied even in "soft" databases more suitable to areas 
that are not based strictly on journal publications. 
If different models follow the same law, in statistical 
physics, one says that this law is universal, in fact 
there are classes of universality. We question, if the 
two mentioned groups of authors, of different areas 
with data collected from different databases, have 
the same pdf, and which one is better to describe 
the data. 

This paper is organized as follows. In Sec. 2, we 
show the details of deduction of the generalized dis- 
tribution of /i-indices extracted from different cita- 
tion distributions for different approaches: i) The 
first one establishes that citation distribution fol- 
lows a escort probability distribution according to 
Eq. (2); ii) The other one prescribes that citations 
are distributed according to a stretched exponen- 
tial via equation (1). In Sec. 3 we present the de- 
tails about databases used to test h-index distribu- 
tion formula. We describe with some details how 
the data were extracted for each studied group. In 
Sec. 4, we present our results, which can be sepa- 
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rated in two distinct parts. In a first (preliminary) 

part we estimate necessary parameters of the ci- 
tation distributions i) and ii) previously reported 
by using the method of moments and by compar- 
isons of the theoretical and empirical cumulated ci- 
tation distribution. Secondly, we so obtain the h- 
index distribution by studying it for the distinct ar- 
eas. The parameters obtained from ft,- index distribu- 
tion fits are compared with the same parameters es- 
timated by citation distribution fits and we conclude 
that generalized statistics prodiicc a ft-index distri- 
bution that corroborates the citation distribution 
differently from stretched exponential that points 
out more accentuated differences between these two 
ways of estimation. We finally summarize our results 
as well as highlight our main contributions in Sec. 5. 



2. /i-index probability density functions for 
groups of authors 

In this section, we consider the continuous limit 
of normalized citation distribution and we describe 
a deduction of the distribution of /i-indices for a 
group of authors for the stretched [10] and gener- 
alized [12] exponential pdf's, which are confronted 
next section. 

The fundamental Hirsch hypothesis [1] leads to 
X = ch^, where c is a constant, which is determined 
making a suitable linear fit. Since the databases 
supplies the number of citations (xj) and the cor- 
responding /i-index of the author (hi), with 
i = 1, . . . , n, one has an estimator to mean citation 
number (x) and c. Collecting a sample of cita- 
tions of all authors of a group, which we denote 
hy Xi,X2, ■ ■ ■ ,Xn corresponding to the respective n 
authors, the estimate of (x), denoted by x, is the 
simple arithmetic mean: 



X1+X2 + ... + Xn 



(3) 



Similarly, an estimator c for the coefficient c comes 
from least square fitting: 

n 'Inn 

c= — "7 , (4) 

i=i n 

that measures the slope in the linear fit of the plot 
X versus h"^. 



2.1. Stretched Exponential PDF 

One interesting pdf used to study the citation dis- 
tribution is the stretched exponential pdf tha we de- 
fine in the continuous case as: 



-{x/xoY 



(5) 



xqT{III3) 

One can estimate xq by calculating (x), which, in 

turn, is estimated by x, i.e., we can demand the 
00 

condition {x) = J dxxPp{x) = xoT{2/l3)/T{l/l3) = 
^ 
X, resulting in: 

^ r(i//?) ^ 
"°"f(2M"- 

According to a; = ch^, the /i-index pdf for the 
stretched exponential is: 

'cr(2//3)/i2 



2c/3r(2//3) 



r(i//3)5 



(7) 

In section 4 we will show fits of the h-'mdex his- 
tograms for the distinct analyzed groups using this 
pdf. Now let us obtain another formula for /i-index 
pdf by using the proposal of generalized exponen- 
tials. 

2.2. Generalized Exponential PDF 

The generalized exponential approach, considers: 

[expg(-Ax)]g 



P.ix) 



(8) 



dy[expg{-Xy)]i ' 

for < a; < 00. To estimate A, one analytically 

calculates the first moment of this pdf (x) = 
xPq{x)dx = l/[(2 — g)A], with does not diverge 
for 1 < g < 2. Next, one estimates A through 

and a hybrid expression for the citation pdf, consid- 
ering that X is an estimator for (x) is: 

W = j^l-^ [cxp,{-x/[(2 - g)x]}]« . (10) 

Now we are able to compute the /i-index distribution 
for a group of researchers: 

"^"^ A [expg(-A^ft2)]9 



H,ih) 



dh 



2c 



(2-9) 



(11) 



(2-9)S. 
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which is normaUzed. Since the parameters (x) and c 
arc estimated by x and c, respectively, the only free 
parameter is q. 

Now we have two candidates for /i-index pdf. 
These pdf's are used to fit our data for the two 
different groups of researchers. 

3. Databases 

Two different researcher groups are collected from 
two different databases. The first database is ob- 
tained from 1203 researchers from 19 Graduate Pro- 
grams in Physics (600 researchers) and 26 Gradu- 
ate Programs in Biology (603 researchers) of public 
universities in Brazil. To obtain the published pa- 
pers, we have used the data from the research digi- 
tal curriculum vitae at public disposal at the Lattes 
database ^ (http: / /lattes. cnpq.br /english) , which is 
a well-established and conceptualized database [16], 
where most Brazilian researchers deposit their vi- 
tae. By a manual process, each researcher publica- 
tion has been extracted from this platform and com- 
pared to the ISI- JCR database ^ by crossing the in- 
formation between these two bases. 

It is important to notice that refinements on 
queries were performed for suitably computing the 
/i-indcx of author. We tunc the details of query 
performed on ISI- JCR until number of papers of 
queried author in this database is the same or as 
close as possible of that one registered by author in 
its Lattes vitae. Only after this process, we compute 
the authors /i-index. 

The same process is run out for each researcher 
of the studied group. This method, although man- 
ual is an excellent filter to obtain data for Brazilian 
researchers. 

In many areas as Computer Science, the authors 
are ranked considering not only papers published in 
scientific journals, but also by papers published in 
important conferences. For sake of the simplicity, we 
concentrate our analysis in software engineering area 
by computing the /i-indices and the total citations 
of the members of program committees of seven dif- 
ferent c;onferences in a total of 600 researchers. For 
these authors, we have used the Harzing program, 
that computes the h-'mdex based on google-scholar 
index ^ . The choice to use Google Scholar instead 



^ The Lattes database provides high-quality data of about 
1.6 million researchers and about 4,000 institutions. 
^ http://apps.isiknowledge.com/ 
^ http://www.haj-zing.com/pop.htm 



of ISI- JCR for computer scientists is because many 

important conferences are not captured by ISI- JCR 
database. It is important to stress, that our inten- 
tion, is to show that the universality of h-'mdex pdf 
is verified even in "soft" databases more suitable to 
areas that are not based strictly on journal publi- 
cations. Also, the number of considered researchers 
in Harzing has not been greater because there is a 
blockage of the system after a number of searches, 
which make our job much more vagarious. 

4. Results 

Our main results are presented below. Consider 
the researchers from Post- Graduations in Physics 
and Biology as Group I and from Conferences Com- 
puter Science (Harzing/Google Scholar) as Group 
II. The plots of citation number as a function of 
each author h"^ are depicted in Fig. 1. The plots in 
Fig. 1 exhibit the behavior of the citation number 
as a function of /i^ , for different authors in each one 
of the studied groups. The proportionality parame- 
ter c, given by Eq. 4, is numerically estimated. Plot 
(a) corresponds to data from Group I (c = 3.75(4)) 
and Plot (b) is a similar plot for the Group II (c = 
5.44(9)). These c" estimates reflect the difference be- 
tween the two areas. 

In what follows, we compute the parameters of ci- 
tation pdf's of both considered groups. The param- 
eters of the stretched exponential pdf [10] and the 
generalized exponential pdf [12] are estimated using 
the method of moments. 

4.1. Stretched Exponential PDF 

For accurately estimating /3 in Eq. (5), we firstly 
use the method of moments and compare it with the 
value obtained from the Zipf plot. The method of 
moments consists in calculating the fc^^ moment of 
the pdf comparing them with the experimental ones. 
Consider the moments of the stretched exponential 
pdf: 

The experimental moments are calculated as x'' = 
{xi + X2 + ■■■ + x^)/n. The method consists in cal- 
culating the best /? value that matches both numer- 
ically calculated {x^) and x'^ for several k values 
(not only for integers). Since (x'^) depends on xq, 
one considers the ratio 
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Fig. 1. These figures exhibit the number of citations as func- 
tion of of different authors, for different groups, used to 
estimate c of Hirsch's relation numerically obtained by 4. 
Plot (a) , for members of post-graduate programs from Brazil 
obtained via Lattes platform-ISI-JCR. Plot (b), corresponds 
to data from members of committee programs of computer 
science conferences obtained by google-Harzing. 



(x'^) r^-i(i//?)r[(fc + i)//3] 



(12) 



to eliminate the xq dependence. The corresponding 
experimental ratio 



(cxp) 



Form the numerical minimization of J($ 



(13) 



(/3) 



^''''P^)2dfc, using Sk = 0.01, one obtains /3 = 0.47 
for Group I and {3 = 0.31 for Group II. In Fig. 2, 
we show plots of theoretical moments for several /? 
values and the experimental moments in same plot 
for both groups studied here. 

The result for Group I, even for the same database 
(ISI-JCR), clearly is different from that found in 
Ref. [10]. However, the result for Group I is closer of 
exponent for the citations of papers (not of authors) 
from journal Physical Review D (/3 w 0.39) and sim- 
ilar to citations of the papers from ISI (/? w 0.44) 
obtained in Ref. [11], in the limit of low citations 
{x < 500). Although the value found for Group II 
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Fig. 2. These plots exhibit the theoretical moments based 
on citations stretched exponential pdf (Eq. 12), compared 
with experimental moments (Eq. 13) for group I (a) with 
13 = 0.47 and II (b) with /3 = 0.31. 

corroborates the value found in Ref. [10] (/? « 0.3), 
no matching was expected because this last result 
was obtained for citations of 1120 most cited authors 
obtained from ISI-JCR, during a time-lag (between 
1981-1997) and our results are based on all citations 
of scientific life of considered authors. 

To test the quality of our fits, we also considered 
suitable Zipf plots. The main idea of Zipf plot is to 
rank the citations of all authors according to xi > 
X2 > > Xn- The upper tail distribution is ex- 
pected to be: 



3 



Q - / Pf3ix)dx - 1 - ^ 



(14) 



where j is the rank of citation Xj . 
exponential, one has: 



n 

For the stretched 



/? r(2//3) 

xT{l/(3) 



dx exp[ 



-r(2//3)^ 

T{l/P)Px'^' 



r(i//3, 



r(i//3) 

(15) 

where r(a, b) = z° ^'dz is known as the in- 
complete gamma function. 

In Fig 3, we display the Zipf plots {Q as function 
of j/n), using f3 = 0.47 for Group I and /3 — 0.31 
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for Group II. One can see a near linear behavior 
with slope close to —1 and intercept close to 1, the 
expected values for both cases. However, some dis- 
crepancies are present. For sake of comparison, we 
show the linear fit (red continuous line) and an ex- 
act expected behavior (dashed blue line). 
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Fig. 3. Zipf plots for the stretched exponential pdf for: (a) 
Group I and (b) Group II. The expected linear behavior by 
relation 14 is tested plotting calculated by 15 as function 
of j/n. 
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Fig. 4. These plots exhibit the theoretical moments based 
on citations generalized exponential pdf (Eq. 16), compared 
with experimental moments (Eq. 13) for group I (a) with 
q = 1.27 and II (b) with q = 1.37. 



The upper tail distribution for the generalized ex- 
ponential is: 



Pq{x)dx 



A.2. Generalized Exponential PDF 

Now we consider the generalized exponential pdf. 
The moment method is addressed using the ratio: 



1 



^ (2 - q)m^ 7o 



dx '. 



1 



(2 - q)x^ 
(16) 

where only the first moment {x) ^ is estimated as 
simple averages: x = 473(23) for Group I and x = 
936(76), for Group II. Similarly to the stretched ex- 
ponential case, we use the same procedure: compare 
with the experimental moments {^'"^^^^) calcu- 
lated by Eq. (13). These plots are displayed in Fig. 4 
and the best adjusted values are q — 1.27, for Group 
I and q = 1.37, for Group II. 



(2 - q)x Ji+S^ 



x'il^^-'i^dx 



9/(1-9) 



exp 



(2 - q)x 



Similarly to Fig. 3, we have used the values of x to 
plot C^j as function of rank {j/n) as illustrated in 
Fig. 5. 

From Figs. 3 and 5, one observes a better linear fit 
for the Zipf plots using the generalized exponential 
pdf. However an important question is if the same 
values of q here estimated for citation distribution 
are also estimated when we perform /i-index distri- 
bution fits. 
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Fig. 5. Zipf plots for the stretched exponential pdf for: (a) 
Group I and (b) Group II. 



Groups 


c 


c 


/3 


X 


Q 


I 


3.75(4) 


3.77(5) 


0.47(1) 


473(23) 


1.27(1) 


II 


5.44(9) 


5.71(11) 


0.31(1) 


936(76) 


1.37(1) 



Table 1 

Summary of the parameter estimated by the method of mo- 
ments for the citation distribution pdf which was fitted as 
stretched exponential and as generalized exponential. Group 
I refers Researchers from Post-Graduations in Physics and 
Biology (Lattes/CNPq-ISI-JCR). Group II refers Confer- 
ences Computer Science (Harzing/Google Scholar). The co- 
efficient c is obtained by fits of the plots of Fig. 1. The coeffi- 
cient estimator c is calculated from Eq. 4. The stretched ex- 
ponential parameter /9 is obtained by the best matching via 
method of moments shown in Fig. 2. The average estimator 
X is the simple arithmetic mean. The generalized exponen- 
tial parameter q is similarly obtained according to Fig. 4. 

4.3. h index PDF 



table 1), we find numerically the (3 that mini- 

mizes xl = Yl=Hra^r. [f^'''''\h) - Hp{h)f^ for the 
stretched exponential pdf and q that minimizes 

X\ = EtTmin [/'^'^P^/i) - H.ihf for the gener- 
alized exponential pdf, where H/^^h) is computed 
by Eq (7) and Hq{h) by Eq. (11). Our computer 
code runs with q ranging from qaun — 1-01 up to 
a ^max = 1-99 and /3, from /3min = 0.01 up to a 
/3^^^ = 0.99 in steps of Aq = = 0.01. The best 
fits, according to the measures, give /3 = 0.66(1) 
and q — 1.26(1), for Group I and (3 = 0.64(1) and 
q = 1.24(1), for Group II using the stretched and 
generalized exponential pdf's, respectively. Both 
pdf's are depicted in Fig. 6 and the comparison 
among the estimated parameters to the ones of 
Table 1, is compiled in Table 2. 



I ] experimental 

— By using stretched exponential 

— By using generalized Statistics 
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h-index 



(a) 



I //^ experimental 
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20 30 40 
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Fig. 6. /i-index distribution for (a) Group I and (b) Group 
II. Blue line corresponds to the stretched exponential pdf 
(Eq. 7) and the red line to the generalized exponential pdf 
(Eq. 11). 



In Table 1, we summarize the estimated param- 
eters for the stretched and generalized exponential 
pdf's using the method of moments and Zipf plots. 

Let us now consider the databases ^.-indices 
pdfs compared to the stretched and generalized 
exponential pdf's [Eqs. (7) and (11)]. Using the 
estimates of c and x for each group studied (see 



Although, from Fig. 6, one can see good fits for 
both pdf's, from Table 2, one observes that gen- 
eralized exponential pdf produces a much better 
matching among the estimated parameter, via cita- 
tion distribution through the moment method and 
/i-index distribution through the method than 
the stretched exponential pdf. For Group I, for the 
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Groups 


/3m 


/3^2 


q-m 


9^2 


I 


0.47(1) 


0.66(1) 


1.27(1) 


1.26(1) 


II 


0.31(1) 


0.64(1) 


1.37(1) 


1.24(1) 



Table 2 

Comparison of the stretched and generahzed exponential 
pdf's parameters estimated by the method of moments and 
procedures. One sees that the generalized exponential pdf, 
has a more robust estimation that the stretched exponential 
pdf. 

generalized exponential pdf, we have; an exact es- 
timated parameter showing its greater robustness 
when compared to the stretched exponential pdf. It 
is important to notice that for Group II, both pdf's 
produce more distant estimates. 

Our results indicate that the generalized expo- 
nential pdf is more appropriate to describe the h- 
index pdf, supplying an interesting and simple for- 
mulae for /i-indices of very distinct groups in differ- 
ent databases. Since the data have been collected 
from very different sources, one can claim the uni- 
versal aspect of the generalized exponential pdf to 
represent the continuous /i-index. 

5. Conclusions 



In the first part of this manuscript, we analyze 
the different formulas for citation distribution cal- 
culating their parameters via two different methods: 
method of moments and by Zipf plots for two very 
distinct groups of researchers pertaining to different 
databases. In the second part, we calculate the h- 
index distribution also for these different databases 
to find a universal formula. Our results show that 
good fits can be obtained for the /i-index pdf using 
suitable estimates and the relation x = ch? . It is also 
important to mention that we have estimated the 
parameters /3 and q in two independent ways and 
by moments and methods: the citation distribu- 
tion of Eqs. (5) and (10) and /i-index distribution 
of Eqs. (7) and (11). Such fits produce more similar 
results for the q distributions. 
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